Rows: 1,338
Columns: 7
$ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
There are 1338 observations of 7 variables.
Insurance claims appear to be distributed relatively evenly between regions; however, the Southeast region has a slight edge over the other three.
Interestingly, the Southeast region also has slightly more smokers than the other three regions.
BMI also appears to be greatest in the Southeast region.
BMI appears to be fairly normally distributed; however, there is a very slight right skew.
The distribution of charges is NOT normally distributed. It has a heavy right skew.
Most points on the scatter plot are bunched towards the bottom of the plot. However, there is still a steady increase in the claim amount.
There is a clear difference between the claim amounts of smokers and non-smokers. Smokers generally make larger claims, and the claim amount increases as age increases. Non-smokers generally make smaller claims, but the claim amount still increases with age.
In this scenario, the trendline is not a valid representation of the data. There are clearly two groups in this plot, and the trendline does not accurately represent either of them.
In this scenario, the trendline does represent the data fairly well. There are a few outliers, but the trendline is fairly accurate for most members of the group.
To further separate the data, it’s possible that separating by gender could be informative, or possibly separating by disease status (cancer vs. non-cancer).
The plurality of insurance claims come from those with zero children. Fewer claims are made by people with more children.
Those with many children (4 or 5) make fewer very expensive insurance claims than others (fewest outliers). There are many outliers when it comes to those with zero children. Those with 1, 2, or 3 children make similar insurance claims.
---
title: "Assignment 7"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "hotpink"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
```
Overall Analysis
===
Column {data-width=1000}
---
```{r}
```
Insurance Dataset
===
Column {data-width=1000}
---
### A glimpse
```{r}
insurance <- read_csv("~/Desktop/MTH209/Labs/insurance.csv")
glimpse(insurance)
```
There are 1338 observations of 7 variables.
By Region
===
Column {data-width=300}
---
### Insurance Claims by Region
```{r}
insurance %>% ggplot(aes(x=region)) + geom_bar(fill = "magenta") + labs(title="Number of Health Insurance Claims by Region", x="Region", y="Count")
```
### Notes
Insurance claims appear to be distributed relatively evenly between regions; however, the Southeast region has a slight edge over the other three.
Column {data-width=300}
---
### Smoking by Region
```{r}
smoker_proportions <- insurance %>%
group_by(region, smoker) %>%
summarise(count = n()) %>%
mutate(proportion = count / sum(count))
ggplot(smoker_proportions, aes(x = region, y = proportion, fill = smoker)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Proportion of Smokers by Region",
x = "Region",
y = "Percentage",
fill = "Smoker")
```
### Notes
Interestingly, the Southeast region also has slightly more smokers than the other three regions.
Column {data-width=300}
---
### BMI by Region
```{r}
insurance %>% ggplot(aes(x=region, y=bmi)) + geom_boxplot(color = "magenta", fill = "limegreen") + labs(title="BMI based on Region", x="Region", y="BMI")
```
### Notes
BMI also appears to be greatest in the Southeast region.
BMI Distribution
===
Column {data-width=1}
---
### Distribution of BMI
```{r}
insurance %>% ggplot(aes(x=bmi)) + geom_histogram(color="hotpink", fill="magenta") + labs(title="BMI Distribution", x="BMI", y="Count")
```
Column {data-width=1}
---
### Notes
BMI appears to be fairly normally distributed; however, there is a very slight right skew.
Insurance Charges
===
Column {data-width=1}
---
### Distribution of Charges
```{r}
insurance %>% ggplot(aes(x=charges)) + geom_histogram(color="hotpink", fill="magenta") + labs(title="Insurance Charge Distribution", x="Charges", y="Count")
```
### Notes
The distribution of charges is NOT normally distributed. It has a heavy right skew.
Column {data-width=1}
---
### Insurance Charges by Age
```{r}
insurance %>% ggplot(aes(x=age, y=charges)) + geom_point(color="magenta") + labs(title="Age vs. Insurance Claim Charge", x="Age", y="Charge")
```
### Notes
Most points on the scatter plot are bunched towards the bottom of the plot. However, there is still a steady increase in the claim amount.
Column {data-width=1}
---
### Insurance Charges by Age and Smoker Status
```{r}
insurance %>% ggplot(aes(x=age, y=charges, color=smoker)) + geom_point() + labs(title="Age vs. Insurance Claim Amount, Depending on Smoker Status", x="Age", y="Claim Amount")
```
### Notes
There is a clear difference between the claim amounts of smokers and non-smokers. Smokers generally make larger claims, and the claim amount increases as age increases. Non-smokers generally make smaller claims, but the claim amount still increases with age.
Validity of Trendlines
===
Column {data-width=1}
---
### Trendline 1
```{r}
smoker <- insurance %>% filter(smoker == "yes")
nonsmoker <- insurance %>% filter(smoker == "no")
smoker %>% ggplot(aes(x=age, y=charges)) + geom_point(color="red") + geom_smooth() + labs(title="Insurance Charges for Smokers", x="Age", y="Charges")
```
### Notes
In this scenario, the trendline is not a valid representation of the data. There are clearly two groups in this plot, and the trendline does not accurately represent either of them.
Column {data-width=1}
---
### Trendline 2
```{r}
nonsmoker %>% ggplot(aes(x=age, y=charges)) + geom_point(color="red") + geom_smooth() + labs(title="Insurance Charges for Nonsmokers", x="Age", y="Charges")
```
### Notes
In this scenario, the trendline does represent the data fairly well. There are a few outliers, but the trendline is fairly accurate for most members of the group.
To further separate the data, it's possible that separating by gender could be informative, or possibly separating by disease status (cancer vs. non-cancer).
Number of Children and Insurance Charges
===
Column {data-width=1}
---
### Distribution of Number of Children
```{r}
children_counts <- insurance %>%
count(children)
ggplot(children_counts, aes(x = "", y = n, fill = as.factor(children))) +
geom_bar(stat = "identity") +
coord_polar("y", start = 0) +
labs(title = "Distribution of Number of Children",
fill = "Number of Children",
x = NULL, y = NULL) +
theme_void() +
theme(legend.position = "right")
```
### Notes
The plurality of insurance claims come from those with zero children. Fewer claims are made by people with more children.
Column {data-width=1}
---
### Distribution of Charges by Number of Children
```{r}
insurance %>% ggplot(aes(x=as.factor(children), y=charges)) + geom_boxplot(fill="skyblue", color="blue") + labs(title="Distribution of Charges Based on Number of Children", x="Number of Children", y="Charges")
```
### Notes
Those with many children (4 or 5) make fewer very expensive insurance claims than others (fewest outliers). There are many outliers when it comes to those with zero children. Those with 1, 2, or 3 children make similar insurance claims.